This is the answers of the exercise by Junmar Sales II

Email:

Dated 08 October 2020.

You can also see this on my github pages: MEE 621 - Activity 3 on Exploratory Data Analysis

This report on the answers of the exercises outline Activity 3 on Exploratory Data Analysis all done in R Studio.

Let’s load the prerequisite libraries, we have assumes that packages has been installed.

library("tidyverse")
## -- Attaching packages -------------------------------------------------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.2     v purrr   0.3.4
## v tibble  3.0.3     v dplyr   1.0.2
## v tidyr   1.1.2     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.5.0
## -- Conflicts ----------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

7.3.4 Exercises

  1. Explore the distribution of each of the x , y , and z variables in diamonds . What do you learn? Think about a diamond and how you might decide which dimension is the length, width, and depth.

First, we make sure we have diamonds data which is part of tidyverse library is available. We could get and calculate summary statistics for these variables and plot their distributions.

summary(select(diamonds, x, y, z))
##        x                y                z         
##  Min.   : 0.000   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.: 4.710   1st Qu.: 4.720   1st Qu.: 2.910  
##  Median : 5.700   Median : 5.710   Median : 3.530  
##  Mean   : 5.731   Mean   : 5.735   Mean   : 3.539  
##  3rd Qu.: 6.540   3rd Qu.: 6.540   3rd Qu.: 4.040  
##  Max.   :10.740   Max.   :58.900   Max.   :31.800
ggplot(diamonds) +
  geom_histogram(mapping = aes(x = x), binwidth = 0.05)

ggplot(diamonds) +
  geom_histogram(mapping = aes(x = y), binwidth = 0.05)

ggplot(diamonds) +
  geom_histogram(mapping = aes(x = z), binwidth = 0.05)

filter(diamonds, x == 0 | y == 0 | z == 0)
diamonds %>%
  arrange(desc(x)) %>%
  head()
diamonds %>%
  arrange(desc(y)) %>%
  head()
diamonds %>%
  arrange(desc(z)) %>%
  head()
ggplot(diamonds, aes(x = x, y = y)) +
  geom_point()

ggplot(diamonds, aes(x = x, y = z)) +
  geom_point()

ggplot(diamonds, aes(x = y, y = z)) +
  geom_point()

filter(diamonds, x > 0, x < 20) %>%
  ggplot() +
  geom_histogram(mapping = aes(x = x), binwidth = 0.02) +
  scale_x_continuous(breaks = 1:20)

filter(diamonds, y > 0, y < 20) %>%
  ggplot() +
  geom_histogram(mapping = aes(x = y), binwidth = 0.02) +
  scale_x_continuous(breaks = 1:20)

filter(diamonds, z > 0, z < 20) %>%
  ggplot() +
  geom_histogram(mapping = aes(x = z), binwidth = 0.02) +
  scale_x_continuous(breaks = 1:20)

summarise(diamonds, mean(x > y), mean(x > z), mean(y > z))

  1. Explore the distribution of price . Do you discover anything unusual or surprising? (Hint: Carefully think about the binwidth and make sure you try a wide range of values.)

ggplot(filter(diamonds, price < 2500), aes(x = price)) +
  geom_histogram(binwidth = 12, center = 0)

ggplot(filter(diamonds), aes(x = price)) +
  geom_histogram(binwidth = 100, center = 0)

diamonds %>%
  mutate(ending = price %% 10) %>%
  ggplot(aes(x = ending)) +
  geom_histogram(binwidth = 1, center = 0)

diamonds %>%
  mutate(ending = price %% 100) %>%
  ggplot(aes(x = ending)) +
  geom_histogram(binwidth = 1)

diamonds %>%
  mutate(ending = price %% 1000) %>%
  filter(ending >= 500, ending <= 800) %>%
  ggplot(aes(x = ending)) +
  geom_histogram(binwidth = 1)

  1. How many diamonds are 0.99 carat? How many are 1 carat? What do you think is the cause of the difference?

diamonds %>%
  filter(carat >= 0.99, carat <= 1) %>%
  count(carat)
diamonds %>%
  filter(carat >= 0.9, carat <= 1.1) %>%
  count(carat) %>%
  print(n = Inf)
## # A tibble: 21 x 2
##    carat     n
##    <dbl> <int>
##  1  0.9   1485
##  2  0.91   570
##  3  0.92   226
##  4  0.93   142
##  5  0.94    59
##  6  0.95    65
##  7  0.96   103
##  8  0.97    59
##  9  0.98    31
## 10  0.99    23
## 11  1     1558
## 12  1.01  2242
## 13  1.02   883
## 14  1.03   523
## 15  1.04   475
## 16  1.05   361
## 17  1.06   373
## 18  1.07   342
## 19  1.08   246
## 20  1.09   287
## 21  1.1    278

  1. Compare and contrast coord_cartesian() vs xlim() or ylim() when zooming in on a histogram. What happens if you leave binwidth unset? What happens if you try and zoom so only half a bar shows?

ggplot(diamonds) +
  geom_histogram(mapping = aes(x = price)) +
  coord_cartesian(xlim = c(100, 5000), ylim = c(0, 3000))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(diamonds) +
  geom_histogram(mapping = aes(x = price)) +
  xlim(100, 5000) +
  ylim(0, 3000)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 14714 rows containing non-finite values (stat_bin).
## Warning: Removed 6 rows containing missing values (geom_bar).

7.4.1 Exercises

  1. What happens to missing values in a histogram? What happens to missing values in a bar chart? Why is there a difference?

diamonds2 <- diamonds %>%
  mutate(y = ifelse(y < 3 | y > 20, NA, y))

ggplot(diamonds2, aes(x = y)) +
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 9 rows containing non-finite values (stat_bin).

diamonds %>%
  mutate(cut = if_else(runif(n()) < 0.1, NA_character_, as.character(cut))) %>%
  ggplot() +
  geom_bar(mapping = aes(x = cut))

  1. What does na.rm = TRUE do in mean() and sum() ?

mean(c(0, 1, 2, NA), na.rm = TRUE)
## [1] 1
sum(c(0, 1, 2, NA), na.rm = TRUE)
## [1] 3
#> [1] 3

7.5.1.1 Exercises

Let’s load the library

library("nycflights13")

  1. Use what you’ve learned to improve the visualisation of the departure times of cancelled vs. non-cancelled flights.

Looks like while non-cancelled flights happen at similar frequency in mornings and evenings, cancelled flights happen at a greater frequency in the evenings.

nycflights13::flights %>% 
  mutate(
    cancelled = is.na(dep_time),
    sched_hour = sched_dep_time %/% 100,
    sched_min = sched_dep_time %% 100,
    sched_dep_time = sched_hour + sched_min / 60
  ) %>% 
  ggplot(mapping = aes(x=sched_dep_time, y=..density..)) + 
  geom_freqpoly(mapping = aes(colour = cancelled), binwidth = .25)+
  xlim(c(5,25))
## Warning: Removed 1 rows containing non-finite values (stat_bin).
## Warning: Removed 4 row(s) containing missing values (geom_path).

Let’s look at the same plot but smooth the distributions to make the pattern easier to see.

nycflights13::flights %>% 
  mutate(
    cancelled = is.na(dep_time),
    sched_hour = sched_dep_time %/% 100,
    sched_min = sched_dep_time %% 100,
    sched_dep_time = sched_hour + sched_min / 60
  ) %>% 
  ggplot(mapping = aes(x=sched_dep_time)) + 
  geom_density(mapping = aes(fill = cancelled), alpha = 0.30)+
  xlim(c(5,25))
## Warning: Removed 1 rows containing non-finite values (stat_density).

We will try boxplot on how it will look like.

nycflights13::flights %>%
  mutate(
    cancelled = is.na(dep_time),
    sched_hour = sched_dep_time %/% 100,
    sched_min = sched_dep_time %% 100,
    sched_dep_time = sched_hour + sched_min / 60
  ) %>%
  ggplot() +
  geom_boxplot(mapping = aes(y = sched_dep_time, x = cancelled))

  1. What variable in the diamonds dataset is most important for predicting the price of a diamond? How is that variable correlated with cut? Why does the combination of those two relationships lead to lower quality diamonds being more expensive?

carat is the most important for predicting price.

cor(diamonds$price, select(diamonds, carat, depth, table, x, y, z))
##          carat      depth     table         x         y         z
## [1,] 0.9215913 -0.0106474 0.1271339 0.8844352 0.8654209 0.8612494

fair cut seem to associate with a higher carat thus while lower quality diamonds may be selling for more that is being driven by the carat of the diamond (the most important factor in price) and the quality simply cannot offset this.

ggplot(diamonds, aes(x = carat, y = price)) +
  geom_point()

ggplot(data = diamonds, mapping = aes(x = carat, y = price)) +
  geom_boxplot(mapping = aes(group = cut_width(carat, 0.1)), orientation = "x")

diamonds %>%
  mutate(color = fct_rev(color)) %>%
  ggplot(aes(x = color, y = price)) +
  geom_boxplot()

ggplot(data = diamonds) +
  geom_boxplot(mapping = aes(x = clarity, y = price))

ggplot(data = diamonds, aes(x = cut, y = carat)) +
  geom_boxplot()

ggplot(data = diamonds, aes(x = cut, y = carat))+
  geom_boxplot()+
  coord_flip()

  1. Install the ggstance package, and create a horizontal boxplot. How does this compare to using coord_flip() ?

# Remember to install with install.packages("ggstance")
library("ggstance")
## 
## Attaching package: 'ggstance'
## The following objects are masked from 'package:ggplot2':
## 
##     geom_errorbarh, GeomErrorbarh
ggplot(data = mpg) +
  geom_boxplot(mapping = aes(x = reorder(class, hwy, FUN = median), y = hwy)) +
  coord_flip()

ggplot(data = mpg) +
  geom_boxploth(mapping = aes(y = reorder(class, hwy, FUN = median), x = hwy))

ggplot(data = mpg) +
  geom_boxploth(mapping = aes(y = reorder(class, hwy, FUN = median), x = hwy))

ggplot(data = mpg) +
  geom_boxploth(mapping = aes(y = reorder(class, hwy, FUN = median), x = hwy), orientation = "y")
## Warning: Ignoring unknown parameters: orientation

  1. One problem with boxplots is that they were developed in an era of much smaller datasets and tend to display a prohibitively large number of “outlying values”. One approach to remedy this problem is the letter value plot. Install the lvplot package, and try using geom_lv() to display the distribution of price vs cut. What do you learn? How do you interpret the plots?

library("lvplot")
ggplot(diamonds, aes(x = cut, y = price)) +  geom_lv()

Perhaps a helpful way to understand this is to see what it looks like at different specified ‘k’ values (which) You can see the letters when you add fill = ..LV.. to the aesthetic.

diamonds %>% 
  ggplot()+
  lvplot::geom_lv(aes(x = cut, y = price, alpha = ..LV..), fill = "blue")+
  scale_alpha_discrete(range = c(0.7, 0))
## Warning: Using alpha for a discrete variable is not advised.

diamonds %>% 
  ggplot()+
  lvplot::geom_lv(aes(x = cut, y = price, fill = ..LV..))

Letters represent ‘median’, ‘fourths’, ‘eights’…

  1. Compare and contrast geom_violin() with a facetted geom_histogram() , or a coloured geom_freqpoly() . What are the pros and cons of each method?

I produce plots for these three methods below. The geom_freqpoly() is better for look-up: meaning that given a price, it is easy to tell which cut has the highest density. However, the overlapping lines makes it difficult to distinguish how the overall distributions relate to each other. The geom_violin() and faceted geom_histogram() have similar strengths and weaknesses. It is easy to visually distinguish differences in the overall shape of the distributions (skewness, central values, variance, etc). However, since we can’t easily compare the vertical values of the distribution, it is difficult to look up which category has the highest density for a given price. All of these methods depend on tuning parameters to determine the level of smoothness of the distribution.

ggplot(data = diamonds, mapping = aes(x = price, y = ..density..)) +
  geom_freqpoly(mapping = aes(color = cut), binwidth = 500)

ggplot(data = diamonds, mapping = aes(x = price)) +
  geom_histogram() +
  facet_wrap(~cut, ncol = 1, scales = "free_y")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(data = diamonds, mapping = aes(x = cut, y = price)) +
  geom_violin() +
  coord_flip()

ggplot(diamonds,aes(x = cut, y = carat))+
  geom_violin()

ggplot(diamonds, aes(x = carat, y = ..density..))+
  geom_histogram()+
  facet_wrap(~cut)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

I like how geom_freqpoly has points directly overlaying but it can also be tough to read some, and the lines can overlap and be tough to tell apart, you also have to specify density for this and geom_histogram whereas for geom_violin it is the default. The tails in geom_violin can be easy to read but they also pull these for each of the of the values whereas by faceting geomo_histogram and setting scales = “free” you can have independent scales. I think the biggest advantage of the histogram is that it is the most familiar so people will know what you’re looking at.

  1. If you have a small dataset, it’s sometimes useful to use geom_jitter() to see the relationship between a continuous and categorical variable. The ggbeeswarm package provides a number of methods similar to geom_jitter() . List them and briefly describe what each one does.

There are two methods:

geom_quasirandom() produces plots that are a mix of jitter and violin plots. There are several different methods that determine exactly how the random location of the points is generated.

geom_beeswarm() produces a plot similar to a violin plot, but by offsetting the points. I’ll use the mpg box plot example since these methods display individual points, they are better suited for smaller datasets.

library(ggbeeswarm)
ggplot(mpg, aes(x = displ, y = cty, color = drv))+
  geom_point()  

ggplot(mpg, aes(x = displ, y = cty, color = drv))+
  geom_jitter()

ggplot(mpg, aes(x = displ, y = cty, color = drv))+
  geom_beeswarm()
## Warning in f(...): The default behavior of beeswarm has changed in version
## 0.6.0. In versions <0.6.0, this plot would have been dodged on the y-axis. In
## versions >=0.6.0, grouponX=FALSE must be explicitly set to group on y-axis.
## Please set grouponX=TRUE/FALSE to avoid this warning and ensure proper axis
## choice.

ggplot(data = mpg) +
  geom_beeswarm(mapping = aes(
    x = reorder(class, hwy, FUN = median),
    y = hwy
  ))

ggplot(mpg, aes(x = displ, y = cty, color = drv))+
  geom_quasirandom()
## Warning in f(...): The default behavior of beeswarm has changed in version
## 0.6.0. In versions <0.6.0, this plot would have been dodged on the y-axis. In
## versions >=0.6.0, grouponX=FALSE must be explicitly set to group on y-axis.
## Please set grouponX=TRUE/FALSE to avoid this warning and ensure proper axis
## choice.

geom_jitter is similar to geom_point but it provides random noise to the points. You can control these with the width and height arguments. This is valuable as it allows you to better see points that may overlap one another. geom_beeswarm adds variation in a uniform pattern by default across only the x-axis. geom-quasirandom also defaults to distributing the points across the x-axis however it produces quasi-random variation, ‘quasi’ because it looks as though points follow some interrelationship21 and if you run the plot multiple times you will get the exact same plot whereas for geom_jitter you will get a slightly different plot each time. To see the differences between geom_beeswarm and geom_quasirandom` it’s helpful to look at the plots above, but holding the y value constant at 1.

ggplot(data = mpg) +
  geom_quasirandom(mapping = aes(
    x = reorder(class, hwy, FUN = median),
    y = hwy
  ))

ggplot(data = mpg) +
  geom_quasirandom(
    mapping = aes(
      x = reorder(class, hwy, FUN = median),
      y = hwy
    ),
    method = "tukey"
  )

ggplot(data = mpg) +
  geom_quasirandom(
    mapping = aes(
      x = reorder(class, hwy, FUN = median),
      y = hwy
    ),
    method = "tukeyDense"
  )

ggplot(data = mpg) +
  geom_quasirandom(
    mapping = aes(
      x = reorder(class, hwy, FUN = median),
      y = hwy
    ),
    method = "smiley"
  )

Exercise 7.5.2

  1. How could you rescale the count dataset above to more clearly show the distribution of cut within colour, or colour within cut?

Proportion cut in color: (change group_by() to group_by(cut, color) to set-up the converse)

cut_in_color_graph <- diamonds %>% 
  group_by(color, cut) %>% 
  summarise(n = n()) %>% 
  mutate(proportion_cut_in_color = n/sum(n)) %>%
  ggplot(aes(x = color, y = cut))+
  geom_tile(aes(fill = proportion_cut_in_color))+
  labs(fill = "proportion\ncut in color")
## `summarise()` regrouping output by 'color' (override with `.groups` argument)
cut_in_color_graph

Similarly, to scale by the distribution of color within cut,

diamonds %>%
  count(color, cut) %>%
  group_by(cut) %>%
  mutate(prop = n / sum(n)) %>%
  ggplot(mapping = aes(x = color, y = cut)) +
  geom_tile(mapping = aes(fill = prop))

I add limit = c(0, 1) to put the color scale between (0, 1). These are the logical boundaries of proportions. This makes it possible to compare each cell to its actual value, and would improve comparisons across multiple plots. However, it ends up limiting the colors and makes it harder to compare within the dataset. However, using the default limits of the minimum and maximum values makes it easier to compare within the dataset the emphasizing relative differences, but harder to compare across datasets.

  1. Use geom_tile() together with dplyr to explore how average flight delays vary by destination and month of year. What makes the plot difficult to read? How could you improve it?

flights %>%
  group_by(month, dest) %>%
  summarise(dep_delay = mean(dep_delay, na.rm = TRUE)) %>%
  ggplot(aes(x = factor(month), y = dest, fill = dep_delay)) +
  geom_tile() +
  labs(x = "Month", y = "Destination", fill = "Departure Delay")
## `summarise()` regrouping output by 'month' (override with `.groups` argument)

There are several things that could be done to improve it,

sort destinations by a meaningful quantity (distance, number of flights, average delay) remove missing values How to treat missing values is difficult. In this case, missing values correspond to airports which don’t have regular flights (at least one flight each month) from NYC. These are likely smaller airports (with higher variance in their average due to fewer observations). When we group all pairs of (month, dest) again by dest, we should have a total count of 12 (one for each month) per group (dest). This makes it easy to filter.

I improved the original graph by adding in a filter so that only destinations that received over 10000 flights were included:

flights %>% 
  group_by(dest, month) %>% 
  summarise(delay_mean = mean(dep_delay, na.rm=TRUE), 
            n = n()) %>% 
  mutate(sum_n = sum(n)) %>% 
  select(dest, month, delay_mean, n, sum_n) %>% 
  as.data.frame() %>% 
  filter(dest == "ABQ") %>% 
  #the sum on n will be at the dest level here
  filter(sum_n > 30) %>% 
  ggplot(aes(x = as.factor(month), y = dest, fill = delay_mean))+
  geom_tile()
## `summarise()` regrouping output by 'dest' (override with `.groups` argument)

Another way to improve it may be to group the destinations into regions. This also will prevent you from filtering out data. We aren’t given region information, but we do have lat and long points in the airports dataset.

  1. Why is it slightly better to use aes(x = color, y = cut) rather than aes(x = cut, y = color) in the example above?

If you’re comparing the proportion of cut in color and want to be looking at how the specific cut proportion is changing, it may easier to view this while looking left to right vs. down to up. Compare the two plots below.

cut_in_color_graph

cut_in_color_graph+
  coord_flip()

It’s usually better to use the categorical variable with a larger number of categories or the longer labels on the y axis. If at all possible, labels should be horizontal because that is easier to read.

7.5.3.1 Exercises

  1. Instead of summarising the conditional distribution with a boxplot, you could use a frequency polygon. What do you need to consider when using cut_width() vs cut_number() ? How does that impact a visualisation of the 2d distribution of carat and price ?

Both cut_width() and cut_number() split a variable into groups. When using cut_width(), we need to choose the width, and the number of bins will be calculated automatically. When using cut_number(), we need to specify the number of bins, and the widths will be calculated automatically.

In either case, we want to choose the bin widths and number to be large enough to aggregate observations to remove noise, but not so large as to remove all the signal.

If categorical colors are used, no more than eight colors should be used in order to keep them distinct. Using cut_number, I will split carats into quantiles (five groups).

ggplot(
  data = diamonds,
  mapping = aes(color = cut_number(carat, 5), x = price)
) +
  geom_freqpoly() +
  labs(x = "Price", y = "Count", color = "Carat")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

You should keep in mind how many lines you are going to create, they may overlap each other and look busy if you’re not careful.

smaller <- diamonds %>% 
  filter(carat < 3)
ggplot(smaller, aes(x = price)) +
  geom_freqpoly(aes(colour = cut_number(carat, 10)))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

For the visualization below I wrapped it in the funciton plotly::ggplotly(). This funciton wraps your ggplot in html so that you can do things like hover over the points.

p <- ggplot(smaller, aes(x=price))+
  geom_freqpoly(aes(colour = cut_width(carat, 0.25)))

plotly::ggplotly(p)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

  1. Visualise the distribution of carat, partitioned by price.

ggplot(diamonds, aes(x = price, y = carat))+
  geom_violin(aes(group = cut_width(price, 2500)))

Plotted with a box plot with 10 bins with an equal number of observations, and the width determined by the number of observations.

ggplot(diamonds, aes(x = cut_number(price, 10), y = carat)) +
  geom_boxplot() +
  coord_flip() +
  xlab("Price")

  1. How does the price distribution of very large diamonds compare to small diamonds? Is it as you expect, or does it surprise you?

diamonds %>% 
  mutate(percent_rank = percent_rank(carat),
         small = percent_rank < 0.025,
         large = percent_rank > 0.975) %>% 
  filter(small | large) %>% 
  ggplot(aes(large, price)) +
  geom_violin()+
  facet_wrap(~large)

Small diamonds have a left-skewed price distribution, large diamonds have a right skewed price distribution.

However, the distribution of very large diamonds is more variable. I am not surprised, since I knew little about diamond prices. After the fact, it does not seem surprising (as many thing do). I would guess that this is due to the way in which diamonds are selected for retail sales. Suppose that someone selling a diamond only finds it profitable to sell it if some combination size, cut, clarity, and color are above a certain threshold. The smallest diamonds are only profitable to sell if they are exceptional in all the other factors (cut, clarity, and color), so the small diamonds sold have similar characteristics. However, larger diamonds may be profitable regardless of the values of the other factors. Thus we will observe large diamonds with a wider variety of cut, clarity, and color and thus more variability in prices.

  1. Combine two of the techniques you’ve learned to visualise the combined distribution of cut, carat, and price.

There are many options to try, so your solutions may vary from mine. Here are a few options that I tried.

ggplot(diamonds, aes(x = carat, y = price)) +
  geom_hex() +
  facet_wrap(~cut, ncol = 1)

ggplot(diamonds, aes(x = cut_number(carat, 5), y = price, colour = cut)) +
  geom_boxplot()

ggplot(diamonds, aes(colour = cut_number(carat, 5), y = price, x = cut)) +
  geom_boxplot()

ggplot(diamonds, aes(x = carat, y = price))+
  geom_jitter(aes(colour = cut), alpha = 0.2)+
  geom_smooth(aes(colour = cut))
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

ggplot(diamonds, aes(x = carat, y = price))+
  geom_boxplot(aes(group = cut_width(carat, 0.5), colour = cut))+
  facet_grid(. ~ cut)

diamonds %>% 
  mutate(carat = cut(carat, 5)) %>% 
  ggplot(aes(x = carat, y = price))+
  geom_boxplot(aes(group = interaction(cut_width(carat, 0.5), cut), fill = cut), position = position_dodge(preserve = "single"))

  1. Two dimensional plots reveal outliers that are not visible in one dimensional plots. For example, some points in the plot below have an unusual combination of x and y values, which makes the points outliers even though their x and y values appear normal when examined separately.

ggplot(data = diamonds) +
  geom_point(mapping = aes(x = x, y = y)) +
  coord_cartesian(xlim = c(4, 11), ylim = c(4, 11))

Why is a scatterplot a better display than a binned plot for this case?

Binned plots give less precise value estimates at each point (constrained by the granularity of the binning) so outliers do not show-up as clearly. They also show less precise relationships between the data. The level of variability (at least with boxplots) can also be tougher to intuit. For example, let’s look at the plot below as a binned boxplot.

ggplot(data = diamonds) +
  geom_boxplot(mapping = aes(x = cut_width(x, 1), y = y)) +
  coord_cartesian(xlim = c(4, 11), ylim = c(4, 11))